cshea26_OriginalHomeworkCode_03
#First load the data (csv)
zombies <- read.csv("https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/refs/heads/master/AN588_Spring25/zombies.csv", header = TRUE, stringsAsFactors = FALSE)
head(zombies) #shortening reading frame to something more manageable; last time i tried doing this i realized my data didn't load correctly so doing this step as a check instead of calling all of the data ## id first_name last_name gender height weight zombies_killed
## 1 1 Sarah Little Female 62.88951 132.0872 2
## 2 2 Mark Duncan Male 67.80277 146.3753 5
## 3 3 Brandon Perez Male 72.12908 152.9370 1
## 4 4 Roger Coleman Male 66.78484 129.7418 5
## 5 5 Tammy Powell Female 64.71832 132.4265 4
## 6 6 Anthony Green Male 71.24326 152.5246 1
## years_of_education major age
## 1 1 medicine/nursing 17.64275
## 2 3 criminal justice administration 22.58951
## 3 1 education 21.91276
## 4 6 energy studies 18.19058
## 5 3 logistics 21.10399
## 6 4 energy studies 21.48355
Question 1.1
Calculate the population mean and standard deviation for each quantitative random variable (height, weight, age, number of zombies killed, and years of education). NOTE: You will not want to use the built in var() and sd() commands as these are for samples.
#shortening mean to m for creating abbreviations to determine population means
summary(zombies) #one way to see mean values for all of the different columns## id first_name last_name gender
## Min. : 1.0 Length:1000 Length:1000 Length:1000
## 1st Qu.: 250.8 Class :character Class :character Class :character
## Median : 500.5 Mode :character Mode :character Mode :character
## Mean : 500.5
## 3rd Qu.: 750.2
## Max. :1000.0
## height weight zombies_killed years_of_education
## Min. :54.15 Min. : 90.29 Min. : 0.000 Min. :0.000
## 1st Qu.:64.68 1st Qu.:131.81 1st Qu.: 2.000 1st Qu.:2.000
## Median :67.50 Median :142.89 Median : 3.000 Median :3.000
## Mean :67.63 Mean :143.91 Mean : 2.992 Mean :2.996
## 3rd Qu.:70.38 3rd Qu.:156.28 3rd Qu.: 4.000 3rd Qu.:4.000
## Max. :80.53 Max. :210.79 Max. :11.000 Max. :8.000
## major age
## Length:1000 Min. :10.66
## Class :character 1st Qu.:18.07
## Mode :character Median :19.90
## Mean :20.05
## 3rd Qu.:21.94
## Max. :29.59
Question 1.2
Other way to calculate the population mean for each quantitative random variable
mean.height <- mean(zombies$height)
mean.weight <- mean(zombies$weight)
mean.age <- mean(zombies$age)
mean.zombieskilled <- mean(zombies$zombies_killed)
mean.eduation <- mean(zombies$years_of_education)
names <- c("height", "weight", "age", "zombies killed", "years of education")
stat_array <- c(mean.height, mean.weight, mean.age, mean.zombieskilled, mean.eduation)
samp_stats <- matrix(stat_array, nrow=5, ncol=1, byrow=TRUE)
rownames(samp_stats) <- names
colnames(samp_stats) <- "Mean"
samp_stats## Mean
## height 67.63010
## weight 143.90748
## age 20.04696
## zombies killed 2.99200
## years of education 2.99600
Question 1.3
Calculating the standard deviation for each of the quantitative random variables
pop_sd <- function(x) {
sqrt(sum((x - mean(x))^2)/(length(x))) #standard deviation is taking the square root of the population variance here, i could have made it into a separate function for population variance but didn't feel like it
}
#changes Now After Jon's Comments
sd_height = pop_sd(zombies$height)
sd_weight = pop_sd(zombies$weight)
sd_age = pop_sd(zombies$age)
sd_zk = pop_sd(zombies$zombies_killed)
sd_yoe = pop_sd(zombies$years_of_education)
#plotting the inputs above in an array
stat_array <- c(sd_height, sd_weight, sd_age, sd_zk, sd_yoe)
samp_stats <- matrix(stat_array, nrow=5, ncol=1, byrow=TRUE)
rownames(samp_stats) <- names
colnames(samp_stats) <- "Standard Deviation"
samp_stats## Standard Deviation
## height 4.307970
## weight 18.391857
## age 2.963583
## zombies killed 1.747551
## years of education 1.675704
Question 2
Use {ggplot} to make boxplots of each of these variables by gender.
#setting colors
colors <- c("#26bf3b", "#1d9bd1", "#d11d89")
#height
plot.height <- ggplot(data = zombies, aes(x = gender, y = height, fill = gender)) +
geom_boxplot(na.rm = TRUE, show.legend = FALSE) +
xlab("Gender") +
ylab("Height") +
theme_bw() +
scale_fill_manual(values = colors)
plot.height <- plot.height + ggtitle("Height") + theme(plot.title = element_text(hjust = 0.5))
#adding title to the plot
#hjust centers it by setting it equal to .5
#weight
plot.weight <- ggplot(data = zombies, aes(x = gender, y = weight, fill = gender)) +
geom_boxplot(na.rm = TRUE, show.legend = FALSE) +
xlab("Gender") +
ylab("Weight") +
theme_bw() +
scale_fill_manual(values = colors)
plot.weight <- plot.weight + ggtitle("Weight") + theme(plot.title = element_text(hjust = 0.5))
#age
plot.age <- ggplot(data = zombies, aes(x = gender, y = age, fill = gender)) +
geom_boxplot(na.rm = TRUE, show.legend = FALSE) +
xlab("Gender") +
ylab("Age") +
theme_bw() +
scale_fill_manual(values = colors)
plot.age <- plot.age + ggtitle("Age") + theme(plot.title = element_text(hjust = 0.5))
#zombies killed
plot.zombies_killed <- ggplot(data = zombies, aes(x = gender, y = zombies_killed, fill = gender)) +
geom_boxplot(na.rm = TRUE, show.legend = FALSE) +
xlab("Gender") +
ylab("Zombies Killed") +
theme_bw() +
scale_fill_manual(values = colors)
plot.zombies_killed <- plot.zombies_killed + ggtitle("Zombies Killed") + theme(plot.title = element_text(hjust = 0.5))
#years of education
plot.years_of_education <- ggplot(data = zombies, aes(x = gender, y = years_of_education, fill = gender)) +
geom_boxplot(na.rm = TRUE, show.legend = FALSE) +
xlab("Gender") +
ylab("Years of Education") +
theme_bw() +
scale_fill_manual(values = colors)
plot.years_of_education <- plot.years_of_education + ggtitle("Years of Education") + theme(plot.title = element_text(hjust = 0.5))
grid.arrange(plot.height, plot.weight, plot.age,
plot.zombies_killed, plot.years_of_education, nrow = 2, ncol = 3) #doing this to make them all sit together on one page as 5 boxplots (similar to par function - i install the gridExtra package for this)Question 3
Use {ggplot} to make scatterplots of height and weight in relation to age. Do these variables seem to be related? In what way?
scat.height <- ggplot(data = zombies, aes(x = age, y = height, color = age)) #making plot
scat.height <- scat.height + scale_color_gradient(low = "blue", high = "green") #adding different colors to differentiate between the two scatterplots
scat.height <- scat.height + xlab("Age") + ylab("Height") + theme_bw() #adding labels for axes
scat.height <- scat.height + geom_point() #making scatterplot!
scat.heightscat.weight <- ggplot(data = zombies, aes(x = age, y = weight, color = age)) #making plot
scat.weight <- scat.weight + scale_color_gradient(low = "pink", high = "purple")
scat.weight <- scat.weight + xlab("Age") + ylab("Weight") + theme_bw() #adding labels for axes
scat.weight <- scat.weight + geom_point() #making scatterplot!
scat.weight#i want to combine them on one page to visualize the data together
grid.arrange(scat.height, scat.weight, nrow = 1, ncol = 2)
Takeaways From Generating Two Scatterplots: Identifying Trends * I
notice that when plotting Age vs. Weight and Age vs. Height, there
appear to be positive correlations in both plots. Younger individuals
are shown to be shorter and lower in weight. Differences across the two
plots highlights greater variability in the Age vs. Weight, where there
is less of a linear correlation (varies a lot)
Question 4.1
Using histograms check whether the quantitative variables seem to be drawn from a normal distribution. Which seem to be and which do not (hint: not all are drawn from the normal distribution)? For those that are not normal, can you determine from which common distribution they are drawn?
par(mfrow = c(2, 3))
hist(zombies$height, main = "Height", xlab="Height", col= "azure", border= "black")
hist(zombies$weight, main = "Weight", xlab="Weight", col= "skyblue", border= "black")
hist(zombies$age, main = "Age", xlab= "Age", col="cornflowerblue", border = "black")
hist(zombies$zombies_killed, main = "Number of Zombies Killed", xlab= "Number of Zombies Killed", col="cyan", border= "black")
hist(zombies$years_of_education, main = "Years of Education", xlab= "Years of Education", col="lightblue", border= "black")
Based on the histograms I created above, the Height, Weight, and Age
categories all follow normal distributions. The other two histograms for
Number of Zombies Killed and Years of Education show a tail extending
toward the right side, indicating they are not normal (possibly
Poisson).
Question 4.2
Using Q-Q plots check whether the quantitative variables seem to be drawn from a normal distribution. Which seem to be and which do not (hint: not all are drawn from the normal distribution)? For those that are not normal, can you determine from which common distribution they are drawn?
#Looking at Q-Q plots now
par(mfrow=c(2,3))
qqnorm(zombies$height, main = "Height")
qqline(zombies$height, col = "gray")
qqnorm(zombies$weight, main = "Weight")
qqline(zombies$weight, col = "gray")
qqnorm(zombies$age, main = "Age")
qqline(zombies$age, col = "gray")
qqnorm(zombies$zombies_killed, main = "Zombies Killed")
qqline(zombies$zombies_killed, col = "gray")
qqnorm(zombies$years_of_education, main = "Years of Education")
qqline(zombies$years_of_education, col = "gray")
Again we see here that the Years of Education and Number of Zombies
Killed are not following the normal distribution for a Q-Q plot I think
these follow the Poisson Distribution instead (which makes sense because
a Poisson would be right skewed when the mean is lower).
Question 5.1
Now use the sample() function to sample ONE subset of 30 zombie survivors (without replacement) from this population and calculate the mean and sample standard deviation for each variable. Also estimate the standard error for each variable, and construct the 95% confidence interval for each mean. Note that for the variables that are not drawn from the normal distribution, you may need to base your estimate of the CI’s on slightly different code than for the normal…
set.seed(123) #i suppose i could also make a function here?
#FOR HEIGHT
s.height <- sample(zombies$height, 30, replace = FALSE)
mean(s.height)## [1] 67.63939
## [1] 3.530531
# time to figure out upper and lower bounds for the 95% CI
sem.height <- sd(s.height)/sqrt(length(s.height))
sem.height## [1] 0.6445838
#figuring out confidence intervals based on that standard error of the mean sem
h.lower <- mean(s.height) - qnorm(1 - 0.05/2) * sem.height
h.upper <- mean(s.height) + qnorm(1 - 0.05/2) * sem.height
ci.height <- c(h.lower, h.upper)
ci.height## [1] 66.37603 68.90275
Question 5.2
## [1] 146.4101
## [1] 14.64095
## [1] 2.673059
w.lower <- mean(s.weight) - qnorm(1 - 0.05/2) * sem.weight
w.upper <- mean(s.weight) + qnorm(1 - 0.05/2) * sem.weight
ci.weight <- c(w.lower, w.upper)
ci.weight## [1] 141.1710 151.6492
Question 5.3
#FOR AGE
set.seed(123)
s.age <- sample(zombies$age, 30, replace = FALSE)
mean.age <- mean(s.age)
sd.age <- (s.age)
sem.age <- sd(s.age)/sqrt(length(s.age))
sem.age #standard error## [1] 0.466644
a.lower <- mean(s.age) - qnorm(1 - 0.05/2) * sem.age
a.upper <- mean(s.age) + qnorm(1 - 0.05/2) * sem.age
ci.age <- c(a.lower, a.upper)
ci.age## [1] 18.41466 20.24387
Question 5.4
#FOR ZOMBIES KILLED
set.seed(123)
s.zombies_killed <- sample(zombies$zombies_killed, 30, replace = FALSE)
mean_killed <- mean(s.zombies_killed)
sd_killed <- sd(s.zombies_killed)
sem.zombies_killed <- sd(s.zombies_killed)/sqrt(length(s.zombies_killed))
sem.zombies_killed #standard error## [1] 0.3097397
z.lower <- mean(s.zombies_killed) - qnorm(1 - 0.05/2) * sem.zombies_killed
z.upper <- mean(s.zombies_killed) + qnorm(1 - 0.05/2) * sem.zombies_killed
ci.zombies_killed <- c(z.lower, z.upper)
ci.zombies_killed## [1] 1.926255 3.140412
Quesion 5.5
#FOR YEARS OF EDUCATION
set.seed(123)
s.years_of_education <- sample(zombies$years_of_education, 30, replace = FALSE)
mean(s.years_of_education)## [1] 3.1
## [1] 1.422722
sem.years_of_education <- sd(s.years_of_education)/sqrt(length(s.years_of_education))
sem.years_of_education #standard error## [1] 0.2597523
y.lower <- mean(s.years_of_education) - qnorm(1 - 0.05/2) * sem.years_of_education
y.upper <- mean(s.years_of_education) + qnorm(1 - 0.05/2) * sem.years_of_education
ci.years_of_education <- c(y.lower, y.upper)
ci.years_of_education## [1] 2.590895 3.609105
#making an array to display all of these samples cleanly
summary_table <- data.frame(
Category = c("Height (cm)", "Weight (kg)", "Age (years)", "Zombies Killed", "Years of Education"),
`Mean` = c(mean(s.height), mean(s.weight), mean(s.age), mean(s.zombies_killed), mean(s.years_of_education)),
`Standard Deviation (SD)` = c(sd(s.height), sd(s.weight), sd(s.age), sd(s.zombies_killed), sd(s.years_of_education)),
`Standard Error of Mean (SEM)` = c(sem.height, sem.weight, sem.age, sem.zombies_killed, sem.years_of_education),
`Confidence Interval (Lower)` = c(h.lower, w.lower, a.lower, z.lower, y.lower),
`Confidence Interval (Upper)` = c(h.upper, w.upper, a.upper, z.upper, y.upper)
)
print(summary_table)## Category Mean Standard.Deviation..SD.
## 1 Height (cm) 67.639389 3.530531
## 2 Weight (kg) 146.410075 14.640945
## 3 Age (years) 19.329267 2.555914
## 4 Zombies Killed 2.533333 1.696514
## 5 Years of Education 3.100000 1.422722
## Standard.Error.of.Mean..SEM. Confidence.Interval..Lower.
## 1 0.6445838 66.376028
## 2 2.6730586 141.170977
## 3 0.4666440 18.414661
## 4 0.3097397 1.926255
## 5 0.2597523 2.590895
## Confidence.Interval..Upper.
## 1 68.902750
## 2 151.649174
## 3 20.243872
## 4 3.140412
## 5 3.609105
Question 6.1
Now draw 99 more random samples of 30 zombie apocalypse survivors, and calculate the mean for each variable for each of these samples. Together with the first sample you drew, you now have a set of 100 means for each variable (each based on 30 observations), which constitutes a sampling distribution for each variable.
set.seed(123)
sample_size <- 30
num_samples <- 100
means_height <- numeric(num_samples) #making a vector to store the means
means_weight <- numeric(num_samples)
means_age <- numeric(num_samples)
means_zombies_killed <- numeric(num_samples)
means_years_of_education <- numeric(num_samples)
for (i in 1:100) {
s.height <- sample(zombies$height, sample_size, replace = FALSE)
s.weight <- sample(zombies$weight, sample_size, replace = FALSE)
s.age <- sample(zombies$age, sample_size, replace = FALSE)
s.zombies_killed <- sample(zombies$zombies_killed, sample_size, replace = FALSE)
s.years_of_education <- sample(zombies$years_of_education, sample_size, replace = FALSE)
means_height[i] <- mean(s.height)
means_weight[i] <- mean(s.weight)
means_age[i] <- mean(s.age)
means_zombies_killed[i] <- mean(s.zombies_killed)
means_years_of_education[i] <- mean(s.years_of_education)
}## [1] 67.63939 68.36121 69.07532 68.32492 66.32369 66.93042 66.53392 68.70451
## [9] 67.27397 65.95768 68.07684 67.51131 67.14272 67.31903 67.89305 67.33863
## [17] 68.73414 68.79888 67.77504 67.78526 65.75114 66.87278 69.03289 68.60095
## [25] 68.00720 66.63878 67.10774 67.13216 68.35493 68.14400 66.56230 67.92487
## [33] 68.76462 67.45579 68.49332 67.62000 67.40862 67.61967 66.04683 66.60914
## [41] 66.15240 67.77779 67.91943 67.24091 67.01702 67.69055 66.64393 68.22686
## [49] 67.71938 67.52770 66.31294 67.22008 66.89954 68.31277 68.64001 66.90464
## [57] 65.49975 67.06253 68.30256 67.13342 67.62964 67.04789 67.16701 67.21979
## [65] 67.86671 67.85763 68.03994 67.23991 67.24076 67.57879 66.38582 67.69028
## [73] 67.58948 67.21146 66.83483 67.57242 67.60684 67.09717 67.50784 67.34585
## [81] 67.78175 67.86136 69.38707 66.82952 67.63564 66.47633 67.49770 67.34487
## [89] 67.42950 66.99378 67.37500 68.65372 68.08114 67.39915 67.31008 68.22363
## [97] 68.20876 67.79679 67.34017 69.01456
Question 6.2
What are the means and standard deviations of this distribution of means for each variable? How do the standard deviations of means compare to the standard errors estimated in [5]?
#now trying to figure out what the means are of for 100 replicates
sampling_means <- data.frame(
Variable = c("Height", "Weight", "Age", "Zombies Killed", "Years of Education"),
Mean_of_Sampling_Distribution = c(
mean(means_height),
mean(means_weight),
mean(means_age),
mean(means_zombies_killed),
mean(means_years_of_education)
),
SD_of_Sampling_Distribution = c(
sd(means_height),
sd(means_weight),
sd(means_age),
sd(means_zombies_killed),
sd(means_years_of_education)
),
SEM_of_Sampling_Distribution = c(
sd(means_height),
sd(means_weight),
sd(means_age),
sd(means_zombies_killed),
sd(means_years_of_education)
)
)
#here for sem, we don't need to divide again by the square root of 100 because we already have created a distribution of means (which is something i didn't realize before)
#here, the standard error represents the true standard deviation
sampling_means## Variable Mean_of_Sampling_Distribution SD_of_Sampling_Distribution
## 1 Height 67.522224 0.7599851
## 2 Weight 143.569940 3.2293293
## 3 Age 20.031425 0.5428518
## 4 Zombies Killed 2.952667 0.3399076
## 5 Years of Education 3.010000 0.2865288
## SEM_of_Sampling_Distribution
## 1 0.7599851
## 2 3.2293293
## 3 0.5428518
## 4 0.3399076
## 5 0.2865288
Question 6.3
What do these sampling distributions look like (a graph might help here)? Are they normally distributed? What about for those variables that you concluded were not originally drawn from a normal distribution?
par(mfrow=c(2,3))
hist(means_height, main = "Histogram of Sampled Heights", xlab = "Mean Height", col="azure" , border = "black")
hist(means_weight, main = "Histogram of Sampled Weights", xlab = "Mean Weight", col="skyblue" , border = "black")
hist(means_age, main = "Histogram of Sampled Ages", xlab = "Mean Age", col="cornflowerblue" , border = "black")
hist(means_zombies_killed, main = "Histogram of Zombies Killed", xlab = "Mean Zombies Killed", col="cyan" , border = "black")
hist(means_years_of_education, main = "Histogram of Years of Education", xlab = "Mean Years of Education", col="lightblue" , border = "black")Overall, the replicated data when creating 100 different means is very similar to the original sample of means that we calculated in question 4. The standard deviations are much lower when calculating them now after generating 100 different means for each variable. I suppose this makes sense as an increase in data sets will lead to a decrease in deviation from the mean. As for the SEM, it remains pretty much the same as it was for the estimated prediciton of the SEM based on one sample. The difference in calcuating now when we have 100 iterations generated is that we no longer divide by the square root of the length of the sample means (as we already have created a distribution of means and it represents the true standard error) As it is now, it looks like the SEM also decreased. Having so many distributions now makes the histograms for the Zombies_Killed and Years_of_Education look a little more normally distributed now too!
5 Challenges I Faced
#1. Initially I struggled to add a toc because it wouldn’t let me knit the file when I was using the rmd formats package. I got this message: formal argument “toc” matched by multiple actual arguments. I deleted it and I suppose now it has a built in toc and therefore I don’t need to specify. Also I am sorry but this is lowkey an ugly theme I can’t lie.
#2. I struggled with loading my data using the curl function but after looking at the previous modules I realized that I needed to use quotation marks surrounding the link to the data set, oops!
#3.^ I ended up abandoning the use of the curl function because I thought that was the original issue (turns out that I didn’t copy the url correctly and that I forgot to use the raw data csv link instead of the link to all of the zombies data). That was definitely a learning curve but I think I understand the correct procedure now.
#4. Trying to work through the last two questions was super discouraging, I wish there was a way to make it so that I don’t have to do the exact same process for each one (there very well could be a way to turn it into a function and I just have no clue). Also I am so confused about what this hint could mean: “Note that for the variables that are not drawn from the normal distribution, you may need to base your estimate of the CIs on slightly different code than for the normal…”
#5. There were so many different parts to answer in the last two questions that I feel like I got lost in answering it. I have no idea how to condense this down into something easier to replicate (maybe a function that I could apply to all the different variables rather than writing each one again). The last one I was going back and forth on using the replicate function vs. the using the numeric 1:100 way (I ended up using this). I was also confused if we are supposed to take a sd for each of the 100 means or calculate one single one at the end
Comments Regarding Changes I Made (Post Peer Commentary):